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ORTHOGONAL BASES FOR VERTEX-MAPPED PYRAMIDS 


JESSE CHAN* AND T. WARBURTON* 

Abstract. Discontinuous Galerkin (DG) methods discretized under the method of lines must handle the inverse of a block 
diagonal mass matrix at each time step. Efficient implementations of the DG method hinge upon inexpensive and low-memory 
techniques for the inversion of each dense mass matrix block. We propose an efficient time-explicit DG method on meshes of 
pyramidal elements based on the construction of a semi-nodal high order basis, which is orthogonal for a class of transformations 
of the reference pyramid, despite the non-affine nature of the mapping. We give numerical results confirming both expected 
convergence rates and discuss efficiency of DG methods under such a basis. 


1. Introduction. Mesh generation has not yet matured to the point where hexahedra-only meshes 
can be constructed for complex geometries. Despite this limitation, hexahedral elements remain popular, 
offering significant benefits over triangular and tetrahedral elements in high order finite element methods. For 
example, exploitation of the tensor-product structure allows for both simple constructions of basis functions 
and cubature rules, as well as fast, low-memory applications of high order operators. An alternative to purely 
hexahedral meshes are hex-dominant meshes [1, 23], which contain primarily hexahedral elements but also a 
small number of tetrahedral, wedge (prism) and pyramid elements, where wedge and pyramid elements are 
used as “glue” elements to facilitate connections between hexahedral and tetrahedral elements [9, 3, 11, 15]. 

Finite elements for the pyramid have been available since the early 1990s [2, 28], though a rigorous 
construction of high order bases for the pyramid has been a more recent development. Nigam and Phillips 
constructed conforming exact sequence finite element spaces in [21, 22], and Bergot, Cohen, and Durufle 
gave explicit orthogonal bases on the pyramid [4, 3]. Both groups showed that, in addition to polynomials, 
the approximation spaces on the pyramid must contain rational functions in order for the trace spaces on 
the faces of the pyramid to remain polynomial, which is necessary for conformity of the global finite element 
space. 


1.1. Techniques for efficient mass matrix inversion. In [16], it was shown that the computational 
structure of DG methods makes them well-suited for accelerators such as graphics processing units (GPUs). 
Under time-explicit DG methods, a block diagonal mass matrix inverse is accounted for at each timestep. 
A key observation for straight-edged simplicial elements is that each block of the mass matrix is a constant 
scaling of the mass matrix over a reference simplex. As a result, it is possible to sidestep the inversion of 
the full mass matrix by using derivative and lift operators which are premultiplied by the inverse of the 
reference mass matrix and applying local scalings. This sidesteps the storage and inversion of individual 
mass matrices over each element, which, due to the limited memory and reduced efficiency of general linear 
algebra routines on GPUs, is not expected to perform well on such accelerators. 

Finite element methods typically define coordinate mappings from a reference to physical element using 
basis functions on the pyramid. Entries of the mass matrix are then computed on the reference element using 
a change of variables factor. For affine mappings of the reference simplex, this factor is constant, implying 
that only one mass matrix needs to be stored and inverted for all such simplices. For trilinear mapped tensor 
product hexahedral elements, this factor is no longer constant, but it is possible to decompose the mapped 
mass matrix into the Kronecker product of ID mass matrices such that this factor is constant in each tensor 
product direction. An alternative procedure is to employ Lagrange polynomials at Gauss-Legendre-Lobatto 
(GLL) quadrature points and construct the lumped mass matrix through inexact numerical integration. This 
yields the Spectral Element Method (SEM), which boasts a trivially invertible diagonal mass matrix whose 
entries are the GLL quadrature weights. 

Bedrosian introduced in [2] low order vertex shape functions for the pyramid which are rational in 
nature. Using such shape functions, transformations of the reference pyramid could be defined in terms of 
vertex positions of the physical pyramid. We consider in this work physical pyramids which are images of 
the reference pyramid under such a map, and refer to these as vertex-mapped pyramids, which are analogous 
to affine mappings of the simplex and trilinear mappings of the hexahedra. 
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For vertex-mapped pyramids, however, we do not observe the advantages of either simplicial or hexahe- 
dral elements. The construction of lumped mass matrices and GLL quadratures for non-hexahedral elements 
is nontrivial [20], and the tensor product structure is absent for the pyramid. An analogue to GLL points 
may not even exist for non-hexahedral domains — Helenbrook showed that, on triangles, there does not 
exist a Lobatto-type quadrature rule which is both exact for polynomials of order 2 N — 1 and has a number 
of points equal to the dimension of order N polynomials [13]. Furthermore, in addition to the fact that 
a non-planar pyramid base produces a non-affine mapping, it was shown in [3] that for non-parallelogram 
pyramid bases, the mapping is not only non-affine, but rational in the r, s, t coordinates. 

Attempts to rectify the costs and complications of non-affine mapped elements have also been proposed 
previously in the context of curvilinear meshes. Several methods have experimented with modifying the 
numerical method or formulation in order to sidestep difficulties in dealing with curvilinear and non-affine 
transformations. For example, Krivodonova and Berger [17] extrapolate boundary conditions on curvilinear 
boundaries to the boundary of a mesh consisting of afhne-mapped triangles. However, while this technique 
allows for the efficient inversion of mass matrices on simplices where the determinant of the Jacobian is 
constant, it does not circumvent the presence of non-affine mappings for pyramids. 

Warburton proposed in [24, 25] a Low-Storage Curvilinear discontinuous Galerkin method (LSC-DG), 
where the local basis functions on each element are taken to be the reference element basis functions divided 
by the square root of the change of variables factor over that element. As a result, the mass matrix is identical 
to the reference element mass matrix for all elements, independent of the local mapping. The analysis in 
[25] includes a convergence analysis with sufficient conditions requiring elements to be asymptotically affine 
to attain design order convergence. These conditions do not hold for general vertex-mapped pyramids, and 
it was observed in [4] that the LSC-DG error stagnated under refinement of pyramidal meshes . 

We present here an alternative low-memory DG method by constructing a basis which yields a diagonal 
mass matrix for arbitrary vertex-mapped pyramids, but spans the same space as the optimal pyramid spaces 
described in [3] and [21]. The resulting DG method on vertex-mapped pyramidal meshes provides both 
efficient inversion of the mass matrix and optimal rates of convergence for high order approximation spaces. 
Numerical results confirm the accuracy and efficiency of this basis compared to LSC-DG and matrix-free 
alternatives, and computational experiments are performed to assess the performance of DG on GPUs. 

2. High order finite elements on the pyramid. We introduce the bi-unit right pyramid V with 
coordinates r, s, t such that 


r,se[-l,-t], ie[-l,l]- 

We also define the Duffy-type mapping from the bi-unit cube to the bi-unit right pyramid with coordinates 
a, 5, c e [—1,1] 


r = (1 + a) 


1 — c 


— 1, s — (1 + 5) 


1 — c 


— 1, t = c. 


which has a change of variables factor of ((1 


— c) /2) 2 . The inverse transform is given by 


a = 


2(1+ r) 
1 -t 


b = 


2(1 + 5 ) 
1 -t 


c — t. 


Quadrature rules for the pyramid may also be constructed by defining a quadrature rule on the bi-unit cube 
and applying the transform to the reference element. 

The vertex functions of Bedrosian [2] are defined as follows on the bi-unit right pyramid: 


vi{r,s,t) 


V 4 (r,s,t) 


(r + t)(s +1) 

2(1 — t) ’ 

(r + l)(s + 1) 

2(1-i) 


v 2 (r,s,t) = - 


(r + t)(s + 1) 

2(1 - 0 ’ 


v 5 {r,s,t) = 


1 + t 


v 3 {r,s,t) = - 


(r + t)(s +1) 

2(1 -i) 
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Fig. 2.1. The reference bi-unit right pyramid (left) and an example of a vertex-mapped pyramid (right). 


The mapping (x,y,z) = F(r,s,t) from the reference pyramid V to the physical vertex-mapped pyramid V 
is then given by 


5 

F(r, ,,*) = £ ViVi(r,s, t), 


where Vi is the coordinate of the ith vertex of the physical pyramid. We also define J, the determinant of 
the Jacobian of F, such that 


J = 


dF x 

OF* 

dF^ 

dr 

ds 

dt 

dFy 

3F\l 

d_F± 

dr 

ds 

dt 

dF z 

dF z 

dF\ 

dr 

ds 

dt 


/ udxdy dz = / 

Jv J-p 


uJ dr ds d t. 


Bergot, Cohen, and Durufle [3] defined an orthonormal basis on the bi-unit right pyramid as follows: let 
P^ ,/3 be the Jacobi polynomial with weights a , /3. Then, define 'ipijk 


/ 1 _ \ k'ij 

ipiAa, b, c) = V¥^Pf\a)P°’\b) l-^\ Pf y+2 (c), 


( 2 . 1 ) 


where jiij = max(i, j) and k <N — fiij. Under the Duffy-type mapping from (a, 6, c) to (r, 8, t) coordinates, 
these ipiji c form an ortho normal basis over the reference bi-unit right pyramid. 

The elements of the mass matrix for the mapped pyramid V are defined as 


Mi 


ijk,i'j'k' 




^Pijk^Pi' j'k' J d-X — 


/:/:/: 


'Ipijk'lp 1 , 




J da db dc. 


For the orthonormal basis of Bergot, Cohen, and Durufle, the mass matrix is no longer diagonal under a non- 
affine mapping, and inversion of the mass matrix must be done individually over every element. However, 
while it is difficult to define an orthonormal basis for an arbitrary non-affine map, it is possible to derive an 
orthogonal basis for a vertex-mapped transformation of the reference pyramid. 

2.1. An orthonormal semi-nodal basis on the mapped element. We first restate Lemma 3.5 of 
Bergot, Cohen, and Durufle [3], which gives that the determinant of the Jacobian J is bilinear when mapped 
under the inverse Duffy-type transform to the bi-unit cube. 
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Lemma 2.1. Let QN a ,N b ,N c be the space of polynomials of individual orders N a , and N c in the a,b,c 
coordinates on the bi-unit cube, and let J be the determinant of the Jacobian mapping. Then, J E Qi,i,o for 
vertex-mapped transformations of the pyramid. 

We may use this fact, along with the fact that the 7V + 1 point Gauss-Legendre quadrature rule integrates 
exactly polynomials of degree 27V + 1, to construct an orthonormal basis for the vertex-mapped pyramid (we 
will refer to this as the semi-nodal basis). To begin, we first show a property of Jacobi polynomials with 
varying weights. 

Lemma 2.2. For i ^ j, 



2 +(N-i)+(N-j) 


p2N-\-3 — 2i,0 ^^p2N-\-3 —2j,0 


(c) dc = C™ Sij , 


where 


1 2 2 ( Ar+1_ d(2A7 + 3 — 2i) ’ 

Proof. Assume without loss of generality that j < i and that TV, i > 0 (since Po is trivially determined for 
any choice of TV). The statement of the Lemma is then equivalent to showing 



2+(N-i)+(N-j) 


P -N +3 - 2i ’°(c)p j ( C ) dc = 0 


for any polynomial Pj(c) of degree j. Since j < i , we may take 

'1 -c x j 


pA c ) = 


j = i — 1 — k 


for i > k > 0. Then, 


2+2 (N-i)+(N-j) 


r l X 2N+3-2i 

/ 1 — C \ 


,2AT+3-2i,0 


(c) 


1 — C 




2AT+3-2i,0 


(c) 


1 — C 


dc 


dc = 0 , 


due to the weighted orthogonality of Jacobi polynomials to polynomials of lower order. Finally, when i = j , 
we may compute 


a N = 


/:w 


2+27V—2i 


(^+3-2i,0 (c) y 


dc = 


N + 2 


2 2(iv+i-q(27v + 3 — 2i) ’ 


□ 


A similar property was also exploited by Beuchler and Schoberl in [5] to construct basis functions for 
the triangle with sparse stiffness matrices. These polynomials are shown in Figure 2.1 for N = 3. Note also 
that a change of index from i, j to (N — i), (N — j) in the above proof gives 



2 +i+j 


p2i+3,0 
r N-i 


(c)p 2 J-F 


(c) dc = Cft-Aj, 


n N — 


7V + 2 

2 2i + 2 (27 + 3) ’ 


We may now construct a semi-nodal basis which is orthogonal on vertex-mapped pyramids by relying on the 
fact that the determinant of the Jacobian is bilinear in a, b constant in c. 

Lemma 2.3. Let a^b^ denote (k + 1 )-point Gauss-Legendre quadrature points with corresponding weights 
w\,w k p Let V be a vertex-mapped pyramid, and let <fijk be defined on the bi-unit cube as 

<Pijk{a, b, c) = £}(a)£$(b) P&*_ + * 3 (c), 
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Fig. 2.2. Polynomials p 2iV + 3 2z (c) for N = 3, normalized by their value 


at c = 1. 


where is the order k Lagrange polynomial which is zero at all but the ith (k + 1) Gauss-Legendre nodes, 
and Pl-k (c) is the Jacobi polynomial of degree k with order-dependent weight 2k + 3. Then, the f>ijk are 
orthogonal with respect to the L 2 inner product over V, and the entries of the mass matrix are 

Mijk,ijk Jijk'tVi'tV jC_ k 

where Jijk is the determinant of the Jacobian evaluated at quadrature points a\,b k . 

Proof. Assume without loss of generality that k > k'. By Lemma 2.1, the tensor product of (k + 1)-point 
Gauss-Legendre quadrature rules integrates exactly 

// Jt\ (a)4' (a)t) (b)£f, (6) da d b. 


The entries of the mass matrix are then 


Mi 


— J^&ijk&i'j'k'J — J J J (fijk&i'j'k' 


ijk,i'j'k' — / Ytj kYi'j 
V 

k k 


1 — C 


= E E ^w^(a^;(a^(bi)^(b m )J lmk 


1=0 rn =o 

x 

k k 




1 — c 


2+k+k' 


d c 


= EE Jlrnk6u6 ii >6 jm 6j J >W?W k m C%_ k 6 kkl 


1=0 m =0 


= Jlmkfiii'fijj'WlWmCN-kfikk'- 


J da d b d c 


where the integral over c yields C^_ k Skk' by Lemma 2.2. □ 

Figure 2.3 shows £ k (a)l k (b), the basis in a,b. The tensor product in the c direction of these Lagrange 
polynomials with the weighted Jacobi polynomials in Figure 2.1 produces the orthogonal semi-nodal basis 
described in Lemma 2.3. 

Lemma 2.4. The semi-nodal basis defined by fajk for 0 < k < N and 0 < i, j < N — k spans the same 
approximation space as that of the orthonormal rational basis fiijk in Equation (2.1). 
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(a) k = 1 


(b) k = 2 


(c) k = 3 


Fig. 2.3. Polynomials £% (a)£j (b) for N = 3 ; shown with Gauss-Legendre points overlaid. 


Proof. From Proposition 1.8 of [3], the orthonormal rational basis on the bi-unit cube ipijk spans the space 

N 

Qn = ^ Qk(a, b)( 1 - c) fe , 

/c=0 

where Qk(a,b) consists of polynomials of order k in both a and b. Since £ k (a)l k (b) spans Qk(a,b), and 
Qk(ct , b ) D Qk-i(a, b) D ..., we need only to show that 

Span j(~lT~) P N k -k( c )^ k = 0,..., iVj = span |(1 — c) k , k s= 0,..., ivj = P N (c), 

where Pn(c) is the space of polynomials of degree N in c. Since (1 — c/2) k is a polynomial of total 

degree TV, it is automatically contained in P/v(c), and it remains to prove the opposite inclusion. Using 
a counting argument and the fact that Pn has dimension N + 1, we may prove the opposite inclusion by 
showing linear independence of (1 — c) k or equivalently (1 — c) N ~ k P^ Ar+3_2/c (c) for 0 < k < N. 

Since P 2Ar+3_2/c (c) has leading order term c fe , it is sufficient to show that (1 — c) N ~ k c k is linearly 
independent. Using the binomial theorem, we may expand 

(i - C ) N ~ k c k = (-i) i ~ k ^j c i+k . 

The lowest order term in the above sum is c k ; since this term is distinct for each 0 < k < TV, this implies 
linear independence of (1 — c) N ~ k c k . □ 

The existence of an L 2 orthogonal basis on the vertex-mapped pyramid also allows us to characterize 
the spectra of the mass matrix more precisely. 

Corollary 2.5. The minimum and maximum eigenvalues of the mapped mass matrix under the rational 
basis (2.1) are given by 


^min — ^min: ^max — '-'max 

where Jmim T ma x are the minimum and maximum values of the determinant of the Jacobian, evaluated at 
the tensor product (N + l ) 2 -point Gauss-Legendre quadrature on the base of the pyramid. 
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Proof. Let M r be the mass matrix constructing using the rationa basis (2.1), and let M be the mass matrix 
constructed using the semi-nodal basis. Scaling <pij k by w^WjC^_ k results in an orthonormal basis, implying 
that M is diagonal with entries equal to the values of J at quadrature points Ou\,b\. Lemma 2.4 then implies 
that there is a linear change of basis S from the rational basis ifijk to the semi-nodal basis such 
that M r = S~ 1 MS. Since M is diagonal, <fijk are eigenfunctions of the mass matrix with corresponding 
eigenvalues equal to the diagonal entries of M. 

Since J G Qi,i,o on the bi-unit cube, J is bilinear in a, b G [—1, l] 2 . Since a bilinear function increases or 
decreases monotonically in a and 6, the maximum and minimum values of J are attained at the points a^, 6^ 
closest to the boundary of [—1,1] 2 - Since a^, bj are given by the kth order Gauss-Legendre rule for k < 7V + 1, 
and the extremal points of the kth order Gauss-Legendre quadrature approach —1 and 1 monotonically in 
the extremal values of J are attained for k = N + 1. 

We complete the proof by noting that the evaluation of the Jacobian factor J at fixed a^, 6^ is constant 
in c, so we may choose c = — 1, which corresponds to the quadrilateral base of the pyramid. 

□ 


3. Numerical results. We begin by comparing the semi-nodal basis constructed in Lemma 2.3 with 
two low-storage alternatives for mass matrix inversion — Chebyshev iteration [ 12 ] and the Low-Storage 
Curvilinear DG method [25]. We then demonstrate the efficiency of GPU-accelerated discontinuous Galerkin 
methods on vertex-mapped pyramidal elements using the new proposed basis. 

3.1. Comparison with Chebyshev iteration. The Chebyshev iteration constructs an explicit matrix 
polynomial using recurrence relations for Chebyshev polynomials, and has experienced revived attention due 
to the fact that it may be formulated purely in terms of matrix-vector multiplications. This is in contrast 
to Conjugate Gradients, which requires inner products at each iteration. On parallel architectures where 
communication between processes is costly, this necessitates a reduction from a subset of parallel processes 
at each iteration. However, unlike Conjugate Gradients, the Chebyshev iteration requires a-priori knowledge 
of tight bounds on the spectrum of the matrix; poor estimates of the minimum and maximum eigenvalue 
may result in slow or stalled convergence. 

The Chebyshev method has previously been used in the global inversion of continuous Galerkin mass 
matrices by Wathen and Rees [26], where eigenvalue bounds for the mass matrix were derived for linear and 
bilinear elements in 2D. 

Suppose the spectra of the mass matrix is contained in [A m i n , A max ]. Then, for the solution x to Mx = 6 , 
the kth Chebyshev iterate Xk satisfies 


\\x-x k \\ 2 


< 


2 T k 


1 -hr 


2 k 


\\ x ~ x 0\\2 J T = 


1 — Amin/A n 
1 + yj Amin/A n 


This above error bound may also be rearranged to yield 

ll* -!Cfc ll 2 -2(^Tl) l ,a: “ x 0 ll 2 ’ (3 - 1} 

where k is the mass matrix condition number. Wathen observed that, for the symmetric, positive-definite 
mass matrix, the error bound for the Chebyshev iteration in (3.1) is nearly identical to the error bound for 
the Conjugate Gradients method; the only difference between the bound for Conjugate Gradients and (3.1) 
is the norm in which the error converges. 

We consider Chebyshev iteration of the mapped mass matrix using the rational basis ( 2 . 1 ) of Bergot, 
Cohen, and Durufle. The bounds on the maximum and minimum eigenvalues of the matrix are given by 
Corollary 2.5, and are confirmed numerically. We construct a mapped pyramid by warping the quadrilateral 
base with displacement magnitude 7 , as shown in Figure 3.1, along with the residual convergence of the 
Chebyshev iteration for various 7 and the expected rate of convergence given by (3.1), which is observed to 
give an accurate estimate of the residual at each step. No significant change was observed in the convergence 
of the Chebyshev iteration with increasing N ; however, the results show that, even for a modestly warped 
pyramid, the iteration count is greater than 10 , which is unacceptably high for the inversion of the mass 
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Iterations 


(a) Warped pyramid 


(b) Chebyshev residual 


Fig. 3.1. A warped pyramid with 7 = 1 , overlaid with values of the Jacobian at quadrature points for N = 5 (left), and 
the convergence history of the Chebyshev iteration for various 7 (right). The expected rate of convergence is given by (3.1). 


matrix — an 0 ( 10 ) iteration count is relatively low for steady state or time-implicit methods, where a small 
number of time steps are used, but is a high cost for explicit-time discontinous Galerkin methods, which 
require multiple inversions of the mass matrix per time-step over millions of timesteps. 

Various preconditioners for the Chebyshev iteration were tested, with mixed results. We observed that 
the diagonal of the mass matrix was constant under all vertex mappings of the pyramid, which rendered 
a Jacobi preconditioner ineffective. We tested also an incomplete Cholesky factorization with tolerance e, 
which improved the number of iterations needed to reach convergence, but introduced additional memory 
costs for storing Cholesky factors for each element. Additionally, the tolerance e required to achieve a fixed 
number of iterations was observed to depend on the magnitude of the displacement 7 , implying that, for 
fixed e, the effectiveness of incomplete Cholesky as a preconditioner would worsen as the shape regularity of 
the pyramid degrades. For architectures such as GPUs, where device memory is typically 0(10) gigabytes, 
such additional storage costs could decrease the maximum problem size by a large factor. 

3.2. Comparison with LSC-DG. The Low-Storage Curvilinear DG method (LSC-DG) exploits the 
property of DG that local approximation spaces do not need to satisfy explicit conformity conditions. War- 
burton proposed the use of specific basis functions 


4>iJ,y,z) 


i>i{r,s,t) 

VI 


where <f>i is the basis function over the reference element iL, and J is the determinant of the mapping Jacobian 
for the physical element K. As a consequence, the entries of the mass matrix 

Mij = f <pj4>i dx dy dz = f J dr d s d t = f 4>j4>i dr ds dt 

Jk Jk J Jk 


are simply the entries of the mass matrix over the reference element K [24]. 

For isoparametric curvilinear mappings, J is polynomial, implying that <pi is rational. Warburton showed 
that, under a scaling assumption on quasi-regular elements, using such basis functions incurs an additional 
constant in the bounds on the best approximation error between a function u and its weighted projection 
u w u. Given such an element K with size h and Jacobian determinant J, the projection error may be bounded 
as follows 


\\u L[ 7/ ,id 


L*(K) 


< Ch N+1 

1 


\fj 


Vj 




W N + l,oo (JQ 


\ W N + l,2( K j 
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o LSC-DG, N = 1, Gamma = 1 

- e - Semi-nodal, N = 1, Gamma = 1 
—a— LSC-DG, N = 2, Gamma = 1 

- b - Semi-nodal, N = 2, Gamma = 1 



Fig. 3.2. L 2 projection errors for the LSC-DG and semi-nodal orthogonal pyramid bases under increasing N and various 
warpings of the reference pyramid (left), as well as under mesh refinement (right). 


where iHlw^+i’ 00 ^) denotes the L°° Sobolev norm of order N + 1 over K. For comparison, the projection 
error bound for curvilinear mappings using standard mapped bases is 


||it - 


< Ch N+1 

1 


\[j 


Vj 

L°° (K) 



L°°(K) 


IN 


W N + 1 ^( K ) ' 


In other words, accuracy of approximation using rational LSC-DG basis functions comes with stricter re¬ 
quirements on the smoothness of the determinant of the Jacobian J. 

Because the mapping for pyramids with non-parallelogram bases involves factors of (1 —1) _1 , the deriva¬ 
tive of the Jacobian mapping gains higher and higher inverse powers of (1 —t). Since the W°°' N+1 norm of 
vi is ill-defined due to this singularity, the bound on LSC-DG projection error does not hold, and we are 
not guaranteed convergence. To illustrate this, we compare projections of the smooth function 


f(x, y, Z) = cosh (a; + y + z) 

on warped pyramids. L 2 projections using the LSC-DG pyramid basis and the semi-nodal basis of Lemma 2.3 
are computed on both the warped element shown in Figure 3.1 for 7 = .2, .5,1, and on meshes of pyramid 
elements. These meshes are constructed by subdividing the bi-unit cube into K id x K \d x K \d hexahedra, 
where K \d is the number of subdivisions along each edge of the cube. Each hexahedra is then subdivided into 
6 pyramids, and the pyramid vertex positions are perturbed randomly to ensure that the determinant of the 
mapping Jacobian is non-constant. L 2 errors are computed on each mesh at various orders of approximation 
N. Figure 3.2 shows the L 2 error for each basis under refinement in both h and TV, with the convergence 
of the LSC-DG error stalling under refinement in each case. We note that Bergot, Cohen, and Durufle also 
observed that error stalled under mesh refinement for fixed order TV. The shape regularity of the pyramid 
(which is controlled by 7) also affects the approximation error, but only by a constant factor. 

4. Efficient discontinuous Galerkin methods on pyramids. Finally, to ascertain the effectiveness 
of the semi-nodal basis for discontinuous Galerkin methods, we examine numerical solutions of the advection 
equation and the acoustic wave equation using time-explicit DG. 

4.1. Advection equation. We consider the advection equation on a bi-unit cube [— 1 , 1] 3 with periodic 
boundary conditions 

3 v 

_ + V-(/3w) = 0, 

where f3 is a vector indicating direction of advection. We assume a mesh consisting purely of pyramidal 
elements K. For each face of iV, we refer to the outward normal as n. Let K E Qh denote a specific element, 
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and let u denote the trace of the solution u and a test function v, respectively, on a face. We may then 
define the jump {uj and average {[?/]} over a face as 

[«] = «--«+ M = ^. 

Let (3 n = {3 • n. Then, the semidiscrete discontinuous Galerkin formulation of the advection equation is 
given locally as 

j K v ~ (jti + v ' ^“0 Ax + J dK (~— v ~ M Ax = °> 

where a is a parameter. For a = 1, an upwind numerical flux is recovered, while for a = 0, a central flux is 
recovered [14]. The formulation is then discretized by representing u,v using the semi-nodal basis 
defined in Lemma 2.3. This results in a system of ODEs 

^ + M~ l ^ S k u + L f F 1 j = 0, 

where Lf, S k are the lift operator and weak derivative matrix defined by 


, N l 

L lj = / dx ~ y 

JdK l=1 

Sij = j dx « wi(pi(xi)f3k(xi) 


i=i 


d<t>j(xi) 

dx k 


and Fi is the flux at the quadrature point xi. N c and N[ denote the number of quadrature points for volume 
and surface integrals, respectively. For constant advection, N c and N[ are taken to be the minimum number 
of quadrature points required to integrate the mass matrix exactly . 1 We adopt this minimial quadrature 
rule, which is defined on the bi-unit cube using a (N + l ) 2 point tensor product Gauss-Legendre quadrature 
in the a, b coordinates and an (N + 1 ) point Gauss-Jacobi quadrature with weights ( 2 , 0) in the c direction. 
The resulting points are then mapped using the Duffy-type transform to the bi-unit pyramid. We note that 
the integrals of derivatives of rational basis functions on the pyramid may be computed exactly using the 
minimal quadrature rule due to cancellation of the rational Jacobian factors with change of variables factors 
for the derivative. For a G [0,1] and periodic boundary conditions, the DG formulation is energy stable [14]. 

The resulting system of ODEs may then be solved in time using a method of lines discretization, such 
as low-storage 4th order Runge-Kutta [ 6 ]. For DG on affine-mapped simplicial elements, each physical mass 
matrix is a constant scaling of the reference mass matrix, and M -1 may be precomputed on the reference 
element and premultiplied with the lift and weak derivative matrices. For DG on pyramids, the mass matrix 
differs on each element, but is diagonal under the semi-nodal pyramid basis. Thus, instead of precomputing 
individual operators for each element, we precompute the diagonal factors of M -1 and apply them at each 
timestep. 


4.2. GPU acceleration. Typical GPU-accelerated implementations break up the solution of the sys¬ 
tem of ODEs resulting from the DG discretization into three steps: computation of volume integrals, surface 
integrals, and a Runge-Kutta update step, which are performed by VolumeKernel, Surf aceKernel, and 
UpdateKernel, respectively: 


du -i 

dt +M 

UpdateKernel 


\ 

^2 S k u + L^F 

t :=1 ^ Surf aceKernel 

VVolumeKernel / 


= 0 . 


1 Non-constant advection is treated identically, though it may be beneficial to increase the number of quadrature points in 
order to offset under-integration (aliasing) effects. 
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This implementation differs slightly from simpler GPU-accelerated implementations of DG methods, in 
that UpdateKernel, in addition to advancing forward in time, applies the inverse of the mass matrix and 
interpolates the solution to surface cubature points. Work is partitioned such that elements (or batches of 
elements) are assigned to independent work-groups, while each work-item/thread processes work for either 
a single basis function or cubature node. 

We consider first, for simplicity of presentation, a purely modal DG method for the pure convection 
equation with (3 = [ 1 , 0, 0] T , and outline the approach used to implement a solver on the GPU. The resulting 
system of equations for this specific constant advection problem is then 

^7 + M ~ X (S x u + L f F) = 0, L{j = [ n x 4>i(x)<fij(x)dx, x)^p-dx. 

at j dK j k ox 


Algorithm 1 Computation of volume integrals, 
l: procedure Volume kernel 

2 : Compute derivatives at cubature points Xi for i = 1 ,..., N c . 


du(xj) 

dx 


N p 

E 

3 = 1 


DljUj 


drjxj ) 
dx 


DtjUj 


ds(xj) 

dx 


D ij u i 


dx ) ' 


3: Scale by premultiplied values of WiJi , compute integral by multiplying by V T . 



N c 


= E 


VfiWiJi 




du(xj ) 
dx 


4.2.1. Volume kernel. The computation of volume integrals requires evaluation of solution values at 
cubature nodes and computation of quadrature sums. We assume, for the reference pyramid, N c volume 
cubature points r*, s*, U and weights Wi. Let V represent the volume Vandermonde matrix, and let D r , D 3 , D l 
represent the derivative matrices with respect to reference coordinates r, 5 , t: 


Vij — tf)j {vi , Si , ), 


D r = 
u 


(b) &i 5 U) 
dr 


D s = 


(h) fi) 
~~ds ’ 


D l = 
u 


d(j)j (ri,Si,tj) 

dt 


We store the above N c x N p matrices, as well as U T , only for the reference element. By storing geometric 

dr dr dr 

change-of-variables factors and determinants I JA of Jacobian mappings at the N c volume 

dx dy dz 

cubature points at each element, we may compute the integral 



da: dy dz 


f . , f du dr du ds du dt\ 

+ -q- s q^ + ft fa) 


J dr ds dt 


using V, D r , D s D t and the Jacobian factors premultiplied by quadrature weights WiJi , as described in Algo¬ 
rithm 1 . 

4.2.2. Surface Kernel. We assume N[ total surface cubature points (over all the faces of the pyramid) 
r {, s {, t{ with surface cubature weights , and we define Vf as the surface Vandermonde matrix V/j = 

(/)j (rf, Storing normals n xp ,n yp , n zp and determinants of Jacobian mappings j/ at surface cubature 

points x{, we may compute surface integrals of the flux F 

(L f F)i= [ fcFdx, F = (n x - a\n x \) M , 

JdK 

as described in Algorithm 2 . This approach differs slightly from that of standard nodal DG algorithms 
in that it does not loop over faces, but computes over all cubature points on the surface of a pyramid at 


ll 













once. This is due to the inhomogeneous nature of the faces on a pyramid — while it is possible to use 
low-memory techniques (discussed in more detail in Section 4.4) to compute surface integrals on triangular 
and quadrilateral faces, they require differentiation between the types of faces within a kernel, or separate 
kernels for triangular and quadrilateral faces. 


Algorithm 2 Computation of surface integrals, 
l: procedure Surface kernel 

2: Compute flux F m ( n x — a \n x \) [r] at face cubature points, scale by surface Jacobian factors and 

weights 

3: Compute integral by multiplying by (V^) T 


n N c 

(L f F)i = / frFiuf) = £( V%w{ J{ Fi. 
JdK i=1 


4.2.3. Update Kernel. Given the diagonal entries of the mass matrix, the timestep dt, and the kth 
step RK coefficients rj, r^, the update kernel inverts the mass matrix, performs both a Runge-Kutta substep 
to march the solution to the next time, and interpolates the new solution to surface cubature nodes for use 
in the next surface kernel, as shown in Algorithm 3. 


Algorithm 3 Runge-Kutta update step with added interpolation to face cubature points, 
l: procedure Runge-Kutta update step 
2 : Compute right-hand side 


where bi is the sum of volume and surface integrals. 

3: Update residual r and local solution at the fcth RK step 

n = r^Ti + dtb k , u k =Ui + r k Ti. 

4: Interpolate local solution to face cubature points using the face Vandermonde matrix V?. 

u f ’ k = V f u k . 


4.2.4. Kernel optimization. We attempted to optimize the above kernels by minimizing the number 
of memory accesses, minimizing non-unit strided memory accesses, maximizing the speed at which data is 
accessed, or hiding the effect of latency in accessing data. 

When data must be accessed repeatedly, we take advantage of the GPU memory hierarchy. Data that 
is used repeatedly within a workgroup is loaded to shared memory, and data used repeatedly in threads is 
loaded to register memory, both of which allow for fast data retrieval . * 2 Likewise, since register memory is 
limited, multiple inputs are concatenated into strided arrays in order to decrease register pressure. 

Finally, we may hide the latency present in memory accesses by exploiting work which may be done 
concurrently. For example, we may prefetch values (such as geometric factors in the computation of volume 


2 Since the shared memory available on GPUs is limited, using a large amount shared memory in a kernel will reduce the 
number of concurrent active work groups, so we do not load derivative matrices and interpolation operators to shared memory 
due to their large number of entries in 3D. These matrix reads are still relatively efficient due to coalescing and caching effects. 
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AMD 

AMD Tahiti + OpenCL 

Nvidia 

Nvidia GeForce GTX 980 + CUDA 

CPU 

Intel Core i7-5960X + OpenMP 


Table 4.1 

Legend of abbreviations for different computational architectures. 



(a) GFLOPS (b) Est. bandwidth 


Fig. 4.1. GFLOPS and estimated effective bandwidth for the advection volume kernel under both the AMD and Nvidia 
setup. 


integrals) before executing other independent commands. Additionally, prefetching may facilitate additional 
compiler optimizations; though the volume kernel achieves roughly the same performance with and without 
prefetching under CUDA, the volume kernel with prefetching achieves an extra 10-25 GFLOPS when running 
under OpenCL. 

To assess the computational performance of the semi-nodal pyramid basis, we implemented a GPU- 
accelerated DG solver using the OCCA scripting language [19]. The solver is written using OCCA kernels, 
which may then be expanded to various threading languages for portability across differing architectures. 
Numerical experiments suggest that OCCA kernels, translated into CUDA, OpenCL, or OpenMP, perform 
nearly as well as hand-tuned kernels written directly in the native language [10]. The DG solver is run for a 
fixed number of timesteps on an Nvidia GeForce GTX 980 using CUDA (which we abbreviate as “Nvidia”) 
and an AMD Tahiti GPU using OpenCL (which we abbreviate as “AMD”). Additionally, we ran the same 
kernels up to N = 4 on an Intel Core i7-5960X CPU using OpenMP (which we abbreviate as “CPU”). The 
GFLOPS and estimated effective bandwidth of each kernel are reported in Figure 4.4. We note that the 
effective bandwidth estimates do not consider caching effects; as a result, the reported numbers may exceed 
the maximum available device bandwidth. Table 4.1 gives a legend of abbreviations and their respective 
computational platforms. 

The mesh is taken to be a 16 x 16 x 16 mesh of hexahedral elements, each of which is then subdivided 
into 6 pyramids to produce 24576 elements. The order is varied from N = 1 to N = 6 (order is limited to 
N = 5 when using OpenCL, due to the memory limitations on workgroup size), and both GFLOPS and 
estimated effective bandwidth (averaged over three runs) are reported for the volume and surface kernels in 
Figures 4.1, 4.2, and 4.3. 

We note that the effect of caching is relatively significant in computing estimated bandwidth. If the 
bandwidth is estimated without counting operator loads, we get 
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N N 

(a) GFLOPS (b) Est. bandwidth 

Fig. 4.2. GFLOPS and estimated effective bandwidth for the advection surface kernel under both the AMD and Nvidia 
setup. 



(a) GFLOPS (b) Est. bandwidth 


Fig. 4.3. GFLOPS and estimated effective bandwidth for the advection RK update kernel under both the AMD and Nvidia 
setup. 


4.3. Acoustic wave equation. We consider also the acoustic wave equation on domain Q with free 
surface boundary conditions p = 0 on dCl. This may be written in first order form 


1 dp 

-T77 + V -U = f 

K Ot 

Out 

p ~8i +Vp = °’ 


where p is acoustic pressure, u is velocity, and p and k are density and bulk modulus, respectively, and are 
assumed to be piecewise constant. 

Let (p~ ,u~) denote the solution fields on the face of an element A, and let (p + ,u + ) denote the solu¬ 
tion on the neighboring element adjacent to that face. Defining the jump of p and the vector velocity u 
componentwise 


[u] = u + — u 
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[pj = p + - p . 



















(a) GFLOPS (b) Est. bandwidth 


Fig. 4.4. 


Gflops and estimated effective bandwidth for advection volume, surface, and update kernels under the CPU setup. 


the semi-discrete variational formulation for the discontinuous Galerkin method may then be given over an 
element as 

/ (k + V “) V ~ da; + jg ^ ' M “ T P bl) V~ dx = J fv~ da; 

Jk (p^j; + Vp) v~ + nf (|p] - T u n • [«]) v~ da; = 0, 

where r p = 1/ {[pc]}, r u = {[pc]}, and c 2 = k/ p is the speed of sound. 

We discretize by again representing u,v using the semi-nodal basis fajk- This converts the variational 
problem into a system of ODEs 

\k =1 

+ pM- 1 ( S k u k + U< k P Uk ) =0, k = 1,.... 3, 
where Z/ is the scalar lift operator, and L kk is the vector lift operator defined by 



. Nf 

f = / n k <pi(x)(l>j(x) da; « 

J 9K I=1 


applied to the penalty terms P p , P u ., and S k is the weak derivative matrix defined by 


QK _ 

£>ij — 


= L^ Mx)ax 


N c 

E 

1=0 


Wi- 


dx k 


M x i ) 


for an appropriate N c point quadrature rule with points xi and weights wi. 

Defining a vector variable U = (p, n), we may write our system of ODEs for the wave equation as 


d P=AU. 

dt 


The computed spectral radii of the RHS matrix p(A) are given in Figure 4.5 for various mesh sizes h 
(computed as the ratio of surface area to volume of an element) and a function of the order of approximation 


15 























Fig. 4.5. Ratio of numerically computed spectral radii p{A), plotted against both mesh size h and 2(N + 1 )(N + 3)/3. 


TV. The spectral radius p( A) gives an estimate of the maximum timestep under which an explicit scheme 
remains stable, and for standard polynomial finite element spaces is proportional to N 2 /h. We observe the 
same behavior numerically for pyramids, and note that the spectral radius shows very good agreement with 
2 (TV + 1)(TV + 3)/3, which is the TV-dependent constant in the discrete trace inequality for the pyramid [8]. 3 

We also report numerical convergence rates in Figure 4.6 for the resonant cavity solution 

p(x, y , z, t) = cos (ttx/2) cos (Try/2) cos {rrz/2) cos . 

over the bi-unit cube [— 1,1] 3 - Meshes are again constructed by subdividing the cube into Kw x I<ld X /Go 
hexahedra, which are then each subdivided into 6 pyramids. Pyramid vertex positions are perturbed to 
ensure J is non-const ant in each element. 

It was confirmed in [3] that the rational basis (2.1) achieves optimal 0(TV + 1) rates of convergence for 
both the L 2 and dispersion error. Since the basis defined by spans the same approximation space as that 
of (2.1), the numerical errors and convergence rates are also of optimal order, and we observe both optimal 
rates of convergence in h and exponential convergence in TV. The errors are computed in double precision on 
the GPU; when using single precision, the convergence rates behave similarly, but L 2 errors stall at around 
10 -6 due to finite precision effects. 

We present GFLOPS and estimated effective bandwidth for the volume, surface, and RK update kernel 
in Figures 4.7, 4.8, and 4.9. While the estimated effective bandwidth for acoustic wave kernels is similar to 
that of the advection kernels, the GFLOPS have increased by a factor of 2-4, due to the reuse of derivative 
and interpolation operators over multiple field variables. This may be further confirmed by examining 
the estimated effective bandwidth without counting operator loads, in which case the reported bandwidth 
decreases by roughly an order of magnitude. 

Figure 4.10 shows GFLOPS and estimated effective bandwidth on an Intel Core i7-5960X CPU using 
OpenMP. Again, the GFLOPS of acoustic wave kernels increase while the estimated effective bandwidth 
remains roughly the same as that of the advection kernels, though the increase is not as pronounced as the 
increase in GFLOPS from advection to the acoustic wave equation on the GPU. 

4.4. Computational improvements. Despite the reported GFLOPS and estimated effective band¬ 
width reported for pyramids above, it is possible to improve the efficiency of DG on pyramids further. 


3 Though the meshes used to compute the spectral radii are uniform, randomly perturbing the vertex positions does not 
change the value of p( A) significantly, which was also observed in [3]. 


16 

















Fig. 4.6. Computed L 2 errors for various orders N and mesh sizes, with optimal rates of h-convergence for reference. 



(a) GFLOPS (b) Est. bandwidth 


Fig. 4.7. GFLOPS and estimated effective bandwidth for the wave volume kernel under both the AMD and Nvidia setup. 


4.4.1. Nodal basis functions. In the above discussions, we discretize by taking the orthogonal 
(modal) basis <pijk defined in Lemma 2.3). However, switching to a nodal discretization using Lagrange 
basis functions at N p distinct points on the pyramid requires only a small modification in the application 
of the mass matrix inverse. Instead of inverting a diagonal matrix, may be inverted by a change of basis 
from a nodal to the orthogonal semi-nodal basis. It is often desirable to define nodal basis functions at 
strong interpolation points with respect to the Lebesgue constant, which is present in upper bounds on the 
interpolation error in the maximum norm. Additionally, for meshes containing multiple element types, it is 
convenient to choose nodal points on the triangular faces with (N + 1)(7V + 2)/2 points on the triangular 
faces and (N + l) 2 points on the quadrilateral face such that the distribution on those faces matches the 
distribution on the faces of either hexahedra or tetrahedra. A survey of various nodal points for the pyramid 
with both low Lebesgue constant and appropriate nodal distributions on faces is given in [7]. 

If the triangular and quadrilateral faces of all elements in a mesh share the same symmetric nodal 
distribution, conformity under continuous Galerkin methods may be enforced simply by matching the nodal 
degrees of freedom on the faces of adjacent elements, and the computation of surface integrals may be 
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(a) GFLOPS (b) Est. bandwidth 

Fig. 4.8. GFLOPS and estimated effective bandwidth for the wave surface kernel under both the AMD and Nvidia setup. 



(a) GFLOPS 


(b) Est. bandwidth 


Fig. 4.9. GFLOPS and estimated effective bandwidth for the wave RK update kernel under both the AMD and Nvidia setup. 


simplified. In particular, for discontinuous Galerkin methods on vertex-mapped pyramids, surface integrals 
may be computed using only nodal degrees of freedom on a face and face mass matrices. For triangular faces, 
the face mass matrices are scalings of the reference nodal face mass matrix, and the quadrilateral face mass 
matrix may be expressed as the Kronecker product of separable scalings of ID nodal mass matrices. As a 
result, the use of nodal mass matrices is more efficient and requires less memory than the use of quadrature 
for the computation of surface integrals. Since the computation of surface integrals is the dominant cost 
for low order discontinuous Galerkin methods, nodal methods are often observed to be more efficient for 
moderate values of N [4, 14]. For larger values of TV, the ratio of interior degrees of freedom to surface 
degrees of freedom increases, and the cost of computing volume integrals becomes dominant. 

4.4.2. Volume kernel evaluation. While the computation of surface integrals may also be performed 
using mass matrices under a nodal basis, volume integrals must still be computed using quadrature due to 
the the rational nature of the mapping. Despite the reported GFLOPS and estimated effective bandwidth, 
the cost of the volume kernel becomes a limiting factor at high orders. Figure 4.11 shows reported runtimes 
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(a) GFLOPS (b) Est. bandwidth 


Fig. 4.10. Gflops and estimated effective bandwidth for wave volume, surface, and update kernels under the CPU setup. 




(a) Kernel runtimes 


(b) Percentage of total runtime 


Fig. 4.11. Runtimes and average percentage of total runtime for each individual kernel. 


and percentage of total runtimes 4 for volume, surface, and update kernels on an Nvidia GeForce GTX 980 
over 100 timesteps; at N > 2, the volume kernel becomes the dominating bottleneck due to use of tensor 
product cubature rules for the pyramid, which results in an 0 (N 6 ) cost in applying derivative operators, in 
contrast to 0 (N 4 ) cost of surface cubature and applying interpolation operators in the surface and update 
kernels, respectively. 

This cost may be alleviated by exploiting the tensor-product nature of the orthogonal pyramid basis 
and quadrature rule. This was done by Bergot, Cohen, and Durufle in [4] to yield lower-cost evaluations of 
volume integrals for the non-orthogonal pyramid basis. It may also be possible to decrease memory costs 
and improve efficiency by adopting quadrature rules constructed directly on the pyramid instead of mapping 
quadrature rules from the bi-unit cube to the pyramid, which typically involve a fewer number of points 
than the (N + l) 3 -point rules currently used [18, 27]. We hope to explore these options in future work. 


4 The percentage of total runtimes are averages of the percentage of total runtime for advection kernels and percentage of 
total runtime for wave kernels. 
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5. Conclusions and acknowledgements. We have presented a new higher order basis which is or¬ 
thogonal on vertex-mapped transformations of the reference pyramid, despite the fact that the transformation 
is non-affine. This allows for low-storage implementations of discontinuous Galerkin methods on pyramids, 
which we hope will aid efficient GPU implementations on hex-dominant meshes. 
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